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DISCOVERY OF THE D-BASIS IN BINARY TABLES BASED ON 
HYPERGRAPH DUALIZATION 

K. ADARICHEVA AND J. B. NATION 


Abstract. Discovery of (strong) association rules, or implications, is an im¬ 
portant task in data management, and it finds application in artificial intel¬ 
ligence, data mining and the semantic web. We introduce a novel approach 
for the discovery of a specific set of implications, called the D-basis, that pro¬ 
vides a representation for a reduced binary table, based on the structure of 
its Galois lattice. At the core of the method are the D-relation defined in 
the lattice theory framework, and the hypergraph dualization algorithm that 
allows us to effectively produce the set of transversals for a given Sperner hy¬ 
pergraph. The latter algorithm, first developed by specialists from Rutgers 
Center for Operations Research, has already found numerous applications in 
solving optimization problems in data base theory, artificial intelligence and 
game theory. One application of the method is for analysis of gene expression 
data related to a particular phenotypic variable, and some initial testing is 
done for the data provided by the University of Hawaii Cancer Center. 


1. Introduction 

Knowledge retrieval from large data sets remains an essential problem in infor¬ 
mation technology and its many usages in finance, biology, economy and social 
sciences. The data is often recorded in binary tables with rows consisting of the 
objects and columns of the attributes, that mark whether a particular object has or 
does not have a particular attribute. The dependencies existing between the sub¬ 
sets of the attributes in the form of association rules, or implications, can uncover 
the laws, causalities and trends hidden in the data; see R. Agrawal et al. [1]. 

An implication X ^ Y (also referred to as a strong association rule in data 
mining, see M. Kryszkiewicz [28) . with parameter of confidence equal to 1) has 
the meaning that every object in the data set that possesses each attribute in the 
subset X will also possess every attribute from the set Y. Such implications provide 
an essential hidden connection between different attributes. Sets of implications 
which are called bases provide an alternative way of storing the tabled data: they 
generate all possible implications that hold in the set of attributes, as their logical 
consequences, and also allow the restoration of the tabled data. Sometimes, only the 
implications A —>■ 6 pertinent to a distinguished attribute b could be of particular 
importance in various experiments that produce tabled data. 

One particular type of a table could be medical data where b may stand for 
a phenotypic attribute, while other attributes are the expression levels of specific 
genes. Then each association rule may represent a specific hypothesis of how the 
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expression of several genes directly impacts the expression of another gene, and via a 
chain of direct interactions, indirectly impacts the phenotypic variation. Biologists 
may then pick some of these hypotheses to verify the interactions with the inventory 
of biochemical methods or animal models. To discriminate between the rules that 
are biologically relevant and those that are not, the concept of the support of the 
rule X ^ b may be applied, which is a portion of the rows of the table where all 
attributes from X are present. The rules with the highest support could be the 
targets of prime interest. 

In this paper we propose a novel algorithm for the retrieval of the strong associa¬ 
tion rules between the sets of attributes in large binary tables, which is particularly 
suited for retrieval of targeted implications X —>■ 5 with a fixed attribute b. At the 
heart of the algorithm lies the connection between an arbitrary binary table and an 
associated algebraic structure known as a Galois lattice. A generating set of strong 
association rules, called a basis, often gives a more concise representation of the 
tabled data and its Galois lattice. Nevertheless, the existing algorithms developed 
in the framework of Formal Goncept Analysis (FGA) are time-exponential in the 
size of the table, and they become ineffective on large data sets. 

Our approach retrieves a complete generating set of implications known as the 
H-basis, that was introduced and tested in K. Adaricheva et al. [5]. This new type of 
basis is amenable to parallel processing, and can be used to find the rules that target 
a particular attribute without constructing the entire Galois lattice. Significantly, it 
employs the powerful optimization algorithm known as the hypergraph dualization, 
see M. Fredman and L. Khachiyan m- Similar arrpoach was used in U. Ryssel, 
F. Distel and D. Borchmann [35] for the retreival of the canonical direct unit basis, 
of which the H-basis is a subset. The code implementations of the dualization 
algorithm were successfully tested on the models of the large size in E. Boros et 
al. |9]. While one can employ the algorithm for arbitrary tables, it could especially 
be effective in tables with the number of attributes considerably larger than the 
number of the objects. One finds this type of data in biological studies, where the 
set of attributes may include millions of genetic variations. 

The paper is organized as follows. We present the association rules and parame¬ 
ters of support and the confidence, as they are known in data mining, in section]^ 
Then we give a general overview for the Galois lattices and FGA methods in section 
1^ and the hypergraph dualization algorithm in section [^ Section describes the 
D-basis, and section [^ gives the general properties of Galois lattices. Our main 
result is in section]^ an algorithm to recover the Z?-basis from a binary table. We 
demonstrate the algorithm on an example in section provide some preliminary 
results of testing the code implementation in section |9| and give an overview of the 
future work in section |T0| 

2. Association rules in data mining 

The leading algorithmic approach in recovery of association rules in data mining 
was formulated in the form of the Apriori algorithm in R. Agrawal et al. |3|. Two 
parameters are essential for association rules: the thresholds of the support and the 
confidence. The support of any set X of attributes is the value Sa{X) (formally 
defined in section]^, which is essentially the number of rows of the table that have 
ones in all columns from X. The set X is called d-frequent, if the ratio of Sa{X) to 
the number of all rows in the table is greater than the given value of the threshold S. 
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The first part of the Apriori algorithm aims at discovery of the maximal ^-frequent 
sets, for a given threshold value S. 

The second part of the algorithm is devoted to splitting any maximal frequent 
set X into FUZ so that the confidence SA{y UF)/S'yi(y) exceeds a given threshold 
/i. In this case, Y ^ Z is an association rule that satisfies both thresholds <5 and fi. 

The time required to build maximal frequent sets, in the bottom-up way as in 
Apriori, is asymptotically proportional to the number of frequent sets, and the 
latter may be exponentially larger than the number of maximal frequent sets. The 
number of maximal frequent sets itself may also be exponential in the size of the 
attributes of the table. The time required to build maximal frequent sets can be 
used as a norm to evaluate the complexity relative to the size of both input and 
output, in other words, the time delay in computation of the next maximal frequent 
set. 

According to a result of E. Boros et al. m, given any subset S of maximal 
frequent sets, the problem of deciding whether it is complete (i.e., includes all max¬ 
imal sets) is NP-hard, even if the number of maximal frequent sets is exponential 
in the number of attributes n, and S is of size 0(n°‘) for small a. 

All this makes the Apriori heuristic inadequate for large data sets, in particular 
with many frequent sets of large size. Another weakness of the Apriori approach is 
its inability to target specific attributes in the conclusion of association rules. Due to 
the nature of the algorithm, the decision about the splitting of a maximal frequent 
set into antecedent and conclusion happens in the second part of the algorithm, 
after the main effort on obtaining the frequent sets is already spent. See further 
details in section 


3. Formal Concept Analysis and basis retrieval 

3.1. Retrieval of the canonical basis. The path from the tabled data to fre¬ 
quent sets of attributes, and then to bases of implications, can be built via the 
Galois relation existing between objects and attributes, and via the structure as¬ 
sociated with this relation known in the algebraic literature as the Galois lattice. 
For example, maximal frequent subsets of the attribute set, which are targeted in 
data mining of transaction tables, are particular elements of the Galois lattice, see 
P. Valtchev et al. [^, while the strong association rules (with the confidence = 1) 
form the implicational basis of this lattice. 

The first study of the Galois lattice appeared in 1940 in G. Birkhoff [8j. It was 
developed further in M. Barbut and B. Monjardet [5], and taken into a field of 
its own, under the name Formal Concept Analysis (or FCA), in B. Ganter and R. 
Wille [H]. FGA took off as a technical tool since the 1980s, mostly for the data 
visualization for various business applications. The applications range from knowl¬ 
edge representation and data mining to knowledge management and the semantic 
web, see J. Poelmans et al. m- 

Most recently, there were several projects in European Commonwealth where 
FCA was involved in biological studies of temporal Boolean networks modeling 
genetic data. See, for example, [3 uni EH ES]- There are growing applications 
of Galois lattices in ontology mappings [371EH1E9] and in description logics [33]. 
A generalization of Galois lattices, known as “pattern structures” in B. Ganter 
and S. Kuznetsov m, deals with data more complex than binary tables. Some 
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applications of these techniques can be found in C. Carpineto and G. Romano [12] 
and M. Kaytoue et al. ^5] , 

The FCA approach is centered around a special implicational basis known in the 
literature as the canonical basis (also stem basis, or Duquenne-Guigues basis). This 
basis has the minimum number of implications, and has a fundamental connection 
to any other basis defining a given Galois lattice and/or binary table. 

The attribute exploration algorithm to recover the canonical basis, developed in 
the FGA framework, see B. Ganter [T5j, requires a search over the entire Galois 
lattice, with respect to some linear order established on the power set of attributes. 
As a result, the algorithm runs in times dependent on at least the size of the 
Galois lattice, which is normally exponential in the number of attributes. Recent 
theoretical results suggest that existing approaches to calculation of the canonical 
basis may not produce algorithms with better worst-case complexity, see |S1[T1]. 

At its core, the implications of the canonical basis are defined by recognizing 
pseudo-closed sets X that are placed into the premises of implications: X ^ Y. 
There is no possibility to use parallel computation of pseudo-closed sets, since every 
new one depends on its subsets being recognized earlier in the process}^ Moreover, 
it may happen that the same attribute b will appear on the right side in several 
implications from the basis. Thus, even if we were interested only in implications 
of the form X ^ b, for this particular b, one would need to reconstruct the whole 
basis. 

3.2. Canonical direct unit basis. The canonical direct unit basis is discussed 
in K. Bertet and B. Monjardet |7] as a basis unifying various definitions given to 
this concept in the literature. As the Duquenne-Guigues basis, it can be defined 
in the abstract framework of a general closure system on a finite set. It consists 
of implications X —>■ & with the property of minimality of X with respect to set 
containment, for any fixed b. In other words, no implication X' —>■ & holds, where 
X' is a proper subset of X. 

The canonical direct unit basis has considerably more implications compared to 
the Duquenne-Guigues basis, but it has a nice feature of being iteration-free basis, 
or direct basis. In fact, it is contained in any other direct basis defining the same 
closure system. Section 11 of [2| contains experimental results on the time needed 
to compute the closure of a randomly chosen set, in the closure systems on 6 and 
7-element sets, given three bases: canonical basis of Duquenne-Guigues, in its unit 
form, the canonical direct unit basis and the D-basis. For the closure systems 
in these computational experiments, we found that the length of the Duquenne- 
Guigues basis was on the order of half the length of the canonical direct basis. On 
the other hand, not being direct or ordered direct, it took on the order of twice as 
long to compute the closure as the other two. 

Recently, U. Russel et al. [32| proposed a method of retrieval of the canonical 
direct unit basis that would employ the hypergraph dualization algorithm. While 
the actual coding applied some back-tracking and simple heuristics instead of ac¬ 
tual implementation of dualization, the computer test results showed considerably 
shorter running times when compared to existing FGA implementations: on a par¬ 
ticular data table of size 26 x 79 the new algorithm would return the basis of 86 

^We mention that some efforts were spent on parallelization of algorithms building a Galois 
lattice from the binary table, such as in P. Krajca et al. [271, but possible utilization of these 
approaches for computation of pseudo-closed sets has not yet been developed. 
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implications in 0.1 sec compared to 6.5 hours of the FCA implementation, and, on 
random tables of size 20 x 40, it spent about 1/50 of time needed for the FCA 
algorithm. 

We note that the H-basis which we deal with in this paper is a subset of the 
canonical direct unit basis, see more details in section]^ and in [2]. 

4. Hypergraph dualization 

Let A be a finite set of cardinality \A\ = n. For a hypergraph (set family) H C 
2^, consider the family of its maximal independent subsets, i.e., maximal subsets 
of A not containing any edge of H. The complement of a maximal independent 
subset is a minimal transversal of H, i.e., a minimal subset of A intersecting all 
edges of H. (Minimal transversals are also called minimal hitting sets.) 

The collection of minimal transversals is called the dual or transversal hy¬ 
pergraph of H. It is easy to see that H'^ is a Sperner hypergraph, i.e., no edge of 
H'^ contains another edge of If H is also Sperner then H = Given a 

Sperner hypergraph, a frequently arising task is the generation of the transversal 
hypergraph H‘^. This problem, known as dualization, can be stated as follows: 

DUAL (iL, G):: Given a complete list of all edges of H, and a set of minimal 
transversals G C either prove that G = or find a new transversal g S 
H^\G. 

Clearly, one can generate all of the minimal transversals in H‘^ (equivalently, all 
the maximal independent sets for H) by initializing G = ib and iteratively solving 
the above problem -1- 1 times. Since can be exponentially large in both 
|iL| and |A|, the complexity of generating H‘^ is customarily measured in the input 
and output sizes. 

According to a result in [16], the problem DUAL(iL, G) can be solved in incre¬ 
mental quasi-polynomial time, i.e., in 0{n) + (™)) time, where n = |A| and 

m = \H\ -I- |G|. Moreover, H‘^ can be generated in incremental polynomial time 
(i.e. DUAL(iL, G) can be solved in time polynomial in |A|, \H\, and |G|) for many 
classes of hypergraphs, see liiiniiis]. 

The hypergraph dualization algorithm is one of various optimizations in database 
theory, artificial intelligence, game theory, and learning theory, to name a few. See 
the survey articles Hg and [g, also the most complete recent account in (53]. Its 
efficient implementation is described and tested in L. Khachiyan et al. [2^. While 
it was shown that the implementation achieves the same theoretical worst bound, 
practical experience with this implementation shows that it can be substantially 
faster. In particular, the code can produce, in a few hours, millions of transversals 
for hypergraphs with hundreds of vertices and thousands of hyper-edges. Further¬ 
more, the experiments also indicate that the delay per transversal scales almost 
linearly with the number of vertices and number of hyper-edges. A more recent 
implementation by K. Murakami and T. Uno |30j not only demonstrates even bet¬ 
ter time performance, but also runs fast on large-scale inputs, for which earlier 
algorithms do not terminate in practical time. 

5. D-basis 

The idea of the Z3-basis comes from concept of the D-relation developed in a 
lattice theoretic framework. The definition of this relation goes back to the work 
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of B. Jonsson, A. Day, R. Freese, and J.B. Nation, which showed that this relation 
played a critical role in the description of the structure of free lattices. See the 
monograph R. Freese et al. [TS]. The Z3-relation plays a key role in defining the 
OD-graph of a finite lattice in J.B. Nation [21], which was widely used in computer 
science literature. This concept was translated into the ZJ-basis in the recent work 

m- 

Essentially, the D-basis is a subset of the canonical direct unit basis that consists 
of implications x ^ b (binary part), as well as X ^ b with |A| > 1, such that 
whenever any a; G A is replaced by any set Y for which x ^ y holds for all y S E, 
and Y ^ X does not hold, the implication no longer holds. In particular, when 
y = 0, this also means that the implication X' ^ b fails for every proper subset 
A' C A. 


For each implication A —> & in the D-basis, the subset A is called a minimal 
cover for 6, in lattice theory framework, and the ZJ-relation is the binary relation 
on the base set of a closure system defined as follows: bDx if a: S A, for some 
minimal cover A for b. 

As we will see in section the closure system of interest for us will be defined 
on the base set A' of attributes of a (reduced, per discussion in section 6.2) binary 
table, which will also become the set of join irreducible elements of the Galois lattice 
L. We will revisit the connection between the D-relation and the D-basis in section 


1631 

The OD-graph of a finite closure system or a finite lattice is defined as a collection 
of all minimal covers, if any, for join irreducible elements of the lattice, together 
with the partially ordered set of all join irreducibles in the lattice. The latter is 
reflected in the binary part of the D-basis, or any other basis for the lattice. 


Example 1. 

Consider the lattice L on Fig.l in section The set of join irreducible elements 
is JiL = {oi, 02 , Cl, C 2 , 6 }. The OD-graph of the lattice contains (JiL,^), where 
^ is inherited from the lattice, i.e., the non-trivial relations are ci ^ ai,b and 
C 2 ^ 02 , 6 . Thus, implications 6 —ci, oi —>■ ci and 02 —>■ C 2 ,b —>■ C 2 should be 
either included into or follow from any set of implications defining closure system 
represented by L. 

OD-graph will also contain the minimal covers, as pairs (A,y), where A C JiL, 
y G JiL and A is a minimal cover for y. Not that {oi, 02 } is not a minimal cover 
for 6 , since oi —)■ ci (while ci —)■ oi does not hold) and oi can be replaced by ci so 
that 0102 —>■ b holds . Thus, the OD-graph will have only two minimal covers for 
b: {oi, C 2 }, { 02 , Cl}, and one minimal cover for each of 01 , 02 : {o 2 ,Ci} and { 01 , 02 }, 
respectively. 

Finally, the D-basis is the implicational form of the OD-graph and consists of 
implications: oi —)• ci, 6 —>■ ci, 02 —>■ C 2 , 6 —>■ C 2 , 01 C 2 —)■ b, aiC 2 —>■ 02 , 02 C 1 — b 

and 02 C 1 oi. 

The canonical direct unit basis will have three extra implications: O 1 O 2 —> 6 , 
bai —> 02 and 602 —> oi, which are not included in the D-basis. 


An important feature of the D-basis is that it is ordered direct, which means that 
it is iteration-free when a special ordering is imposed on the implications. For the 
D-basis, the ordering only requires that all binary implications a; —>■ 6 precede all 
non-binary implications A —>■ c. This property of the basis has an advantage of easy 
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parsing for the processing of its logical consequences. Such processing shows faster 
times than the forward chaining algorithm, which is widely used in industrial logic 
programming, or LINCLOSURE algorithm in data bases, see [5], section 7. Thus 
it retains the important property of directness of the canonical direct unit basis, 
while having, on average, only a portion of implications from the latter. According 
to the test results in [2], the average number of implications in the D-basis, for 
closure systems on a domain of size 7, are between 0.55 and 0.70 of the size of the 
canonical unit basis. 


6. Galois lattice 

6.1. Support function and concepts. Many data sets in computer science are 
presented in the form of binary tables. By a finite binary table we understand a 
relation R C U x A, for the set of objects U (rows of the table) and set of attributes 
A (columns of the table). If r = (u, a) S R, then the position in row u and column 
a is marked by 1. This can be interpreted as meaning that object u possesses 
attribute a. Otherwise, the position is marked with a 0. 

This table represents a Galois connection, and allows us to form the correspond¬ 
ing Galois lattice. In order to define the Galois lattice, one needs to define the 
support function on each of the sets 2^ and 2^ with respect to R. 

Sa ■ 2^ ^ 2^ is called a support function on 2^ if, for every X C A, Sa{X) = 
{y G U : {x,y) G R, for all x G X}. Similarly, the support function Su ■ 2^ ^ 2^ 
is defined for all F CU as Su(Y) = {x G A : {x,y) G R, for all y G Y}. One may 
use the symbol S for notation of both Sa and Sjj, since it is usually clear from the 
context which one should be applied. 

The pair {X,Y) G 2^ x 2^ is called a concept of the relation R, ii Y = S{X) 
and X = S(Y). It is easy to show that if (A'i,Fi) and {X 2 ,Y 2 ) are two concepts 
with Xi C X 2 , then Y 2 C Yi. Thus, the set of all concepts can be ordered with 
respect to the set containment order on their first components, or with respect to 
containment order of their second component, producing two partially ordered sets 
that are dual to each other. 

Moreover, each of these partially ordered sets actually forms a lattice, Lr, or re¬ 
spectively, L)j, and one may refer to one or the other (depending on the preferences 
between sets U or A) as the Galois lattice {concept lattice in EGA) of the relation 
R. 

Since the structure of the Galois lattice Ln is fully determined by its first com¬ 
ponent (or its dual lattice is fully determined by its second component), one can 
establish an isomorphism between the Galois lattice and the lattice of closed 
sets of a closure operator defined on 2'^ by means of the support function. In¬ 
deed, it is straightforward to show that the operator (j)A ■ 2^ ^ 2^ defined as 
4>a{X) = Su{Sa{X)) for X G 2^ is, in fact, a closure operator on A, and the 
closed sets with respect to (I)a are exactly the first components of the concepts of 
R. Thus, C1(A, (j)A) = Tr, where C1(A, ^a) denotes the lattice of closed sets of a 
closure system (A, t/)^). 

6.2. Reductions of the table. There is a well-developed procedure for reducing 
the given relation R to a relation R' C U' x A', where U' C U, A' (- A and 
R' = R\u'xA', so that \U'\ and \A’\ are minimal with respect to property ~ Lr. 
See, for example, section 2 of [5], for the general theoretical outline, and section 
XI.3 in [IH] for algorithmic details. In other words, one may leave only essential 
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objects and attributes in the relation and remove the others. Any analysis can be 
done on the smaller table, while easily extending the outcome to the original sets of 
objects and attributes. In fact, every removed element a & A\A' will be associated 
with some Xa C A' such that (j)A'{a) = 4>A'{Xa)- 

When the procedure of reduction is completed and one obtains the reduced table 
R', one can establish an important relationship between the elements of the set of 
objects [/', the set of attributes A', and the special subset of elements in the Galois 
lattice L'j^. Namely, the set A can be interpreted as the set of join irreducible 
elements of the Galois lattice and U' as the set of meet irreducible 

elements MiL^j^ Moreover, an element 1 at the intersection of row i and column 
j is equivalent to j ^ i in the lattice 

6.3. Arrow relations in the table. We may assume that after the reduction the 
table has rows of n objects and columns of m attributes. The binary table allows 
one to quickly recover additional information on 

(1) Establishing a partial order (C/',^) = (MiL^,^); 

(2) Establishing a partial order (A',^) = (JiL'^,^); 

(3) Establishing arrow relations f, and 'I- 

Recall that for i G JiT^ and j G MiL^, i f j is defined to hold iff j is a maximal 
element among elements of L'^ that are not greater than i. (In lattice terms, it is 
equivalent to: i\/ j = j*, where j* is the unique upper cover of j). 

Dually, f j, J iff i is a minimal element among elements of L'j^ that are not less 
than j. (This is equivalent to: i A j = **, where i* is the unique lower cover of i). 
Finally, i^j, if both j f J and i I j hold. 

It is clear that algorithmically, the reconstruction of arrow relations is equivalent 
to finding maximal or minimal elements in a particular partially ordered set, which 
are sub-posets of either ([/', ^) or (A', <). So there is a straightforward process to 
recover these relations. Thus, we may assume that the reduced table of lattice L'j^ 
is given equipped with the arrow relations. 

6.4. Implicational basis. The Galois lattice can be fully determined by a set 
of implications defined on the set A', or dually, on the set U'. By definition, an 
implication on the set A' is an ordered pair (A, E) G 2^ x 2"^ with A, E 7^0. 
Very often the implication (A, E) is written in the form A —>■ E. A subset Z Q A' 
respects an implication A —>■ E, if whenever ACE one also has Y Q Z. If E is 
a set of implications, then Z respects E whenever Z respects every implication a 
from E. 

There exists a classical connection between closure operators defined on A' and 
sets of implications on A': 

(1) every set of implications E defines a closure operator by setting (j>s{Y) 
as the smallest overset of E that respects E; 

(2) every closure operator (j) can be defined by some set of implications E such 
that the ((>-closed sets are exactly sets that respect E. 

While every set of implications defines the closure operator uniquely, there are 
multiple possibilities to define a set of implications for any given operator. A set of 

^It is a matter of taste which of the two Galois lattices, dual to each other, one chooses. 
For that matter, the set A' may be associated with the set of meet irreducibles rather than join 
irreducibles, which often happens in publications of FCA. 
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implications for a given closure operator, satisfying some conditions of minimality, 
is usually called an implicational basis of this closure system. One can refer to 
the survey article m for further details on connections between closure operators, 
their closure lattices, and sets of implications. 

For application purposes, most often the implications on A! provide an essential 
hidden connection between different attributes in the given data set. For example, 
an implication aia 2 —>■ b, where 01 , 02,6 S A!^ means that every object in the data 
set that possesses both attributes oi and 02 also possesses attribute b. 

6.5. Connection between D-basis and £>-relation. As we mentioned earlier, 
the name of the D-basis is directly connected to the Z3-relation defined in a lattice 
theory framework. Lemma 2.31 in [18] can be formulated as follows. 

Lemma 2. Given two elements b,c G Ji L, the relation bDc holds iff there exists 
an implication X b in the D-basis of L such that c € X. 

In particular, for every implication A —>■ 6 in the D-basis, we have X C bD = 
{x G i'\L : bDx}. 


7. Recovery of the D-basis from the table 


The goal of this section is to prove the main result of the paper. 


Theorem 3. Given a reduced table {U', A', R'), R' C U' x A, one can polynomially 
(in the size of the table) reduce the problem of recovery of the D-basis to (parallel) 
solution of the hypergraph dualization problem formed for each b G A'. 


Proof. Recall from section 6.3 that set A' can be interpreted as the set of join irre¬ 
ducible elements Ji L of the Galois lattice L. The key observation for the recovery 
of the D-basis is contained in m Lemma 11.10] that connects the arrow relations 
of the reduced table with the D-relation on the set of join irreducible elements of 
its Galois lattice: 


bDc iS bf p and c j, p, for some p G Mi L'p,. 

This allows us to recover effectively the sets bD = {cG A' ■. 6 Dc}, for every 6 G Ji L. 

According to Lemma in section [ 6 . 5 [ for every fixed 6 , every implication of the 
D-basis of the form A —)■ 6 will satisfy A C bD = {x G JiL : 6 Dx}. 

Another important subset associated with each b is M{b) = {m G MiL : bf m}. 
Recall that these are the maximal elements in L which are not greater than b. The 
following statement is a simple lattice theoretical observation. 

Claim. For every b G JiL and Y C Ji L, Y ^ b holds in L iff for every m G M{b) 
there exists y GY such that y ^ m. 


This observation reduces the problem of finding the D-basis to the problem of 
finding the minimal transversal sets of a particular Sperner hypergraph. 

Let first consider the general setting of an optimization problem which has a 
standard reduction to the hypergraph dualization problem. Given any set A and 
family A4 = {Mi,... Mfc} C 2^ of its subsets, which, we may assume, are pairwise 
incomparable, consider an order ideal ff generated by this family. In other words, 
ff = {Y C A : A C Mi, for some i ^ k}. Apparently, the family of subsets 
2^ \ ff = {Z C X : Z ^ ff} forms an order filter F in the poset 2^. A well-known 
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optimization problem asks to find all the minimal elements of i.e., sets Yi,... ,Ys 
such that T = {Z C X : Yi C Z ioi some i ^ s}. 

The standard reduction to the hypergraph dualization is done by defining the 
hypergraph {X,H), where H = {Mf,..., M^}, = X \ Mi. Apparently, any 

transversal T of this hypergraph has at least one element from M[ for each i ^ k, 
so that T does not belong to the ideal J generated by Mi,..., M^. Therefore, 
it belongs to 2^ \ J" = F. Vice versa, any element V G V cannot be a subset 
of any Mi, i ^ k, hence, Y H ^ 0. Therefore, Y must be a transversal of 
the hypergraph {X,H). In the conclusion, finding the minimal elements of F is 
equivalent of finding the minimal transversals of {X,H). 

We will proceed by setting an instance of optimization problem above, for each 
b G A'. Let X = bD = {c £ A' : bDc\. Consider family of subsets Ai = {M^ = 
bD n [0,m] : m G M{b)}. Here 0 stands for the smallest element of the lattice L 
and [0, to] = {a: G A : a: ^ to}. For each to, this is equivalent to taking only those 
elements from bD that are in the relation R' with to. 

Let J' be an order ideal in 2^ = 2^^ generated by a family A4. According to 
the Claim, we need to find subsets Y C bD that do not belong to Thus, finding 
the minimal such sets Y is equivalent to an instance of the optimization problem 
we discussed above. 

Note that, in comparison with the method of |32| . we build a hypergraph on bD 
rather than whole A', which gives the reduction of the hypergraph size. 

The hypergraph dualization problem would search for the minimal elements of 
F = 2^^ \ J, where J is the order ideal in 2^^ generated by the family Ai. 

If Yi, I 2 ..., Vs are such minimal elements, then the set of implications V —>■ b, 
i = 1,... k, gives a set of implications satisfying two properties: 

• 1} is a minimal subset X such that X ^ b holds; 

• VC bD. 

In particular, all these implications belong to the canonical unit basis, and every 
implication from the D-basis of the form X —>■ b is included into this list. The full 
H-basis will be recovered, when this process is applied to all bD 7 ^ 0, 6 G A'. 

As a result, the problem to recover the D-basis is polynomially reduced to at most 
t runs of the dualization algorithm on 2^% Xi C A!, where t= |A'| is the number 
of the attributes. It is possible, however, that some of the recovered implications 
are not in the D-basis, so the recovered set of implications may contain the D-basis 
properly. □ 


8. Example 

Let us consider the procedure on a small table with 6 objects and 7 attributes. 
All symbols in the table other than 1 should be first interpreted as Os. 

Here U = {1, 2, 3,4, 5,6}, A = {b,ai,a 2 ,ci,C 2 ,u,v}. Since S{ci) = S{u), at¬ 
tribute u can be reduced. Attribute v also can be reduced, because S{S{v)) = A, 
and A \ u is not a first component of any concept. One can also reduce ob¬ 
ject 5 due to S'(5) = S'(4), and reduce object 6 because 6 G S{S{i)), for every 
i G {1,2, 3,4, 5}. Thus, one can consider the reduced table with U' = (1,2,3,4}, 
A' = { 6 , 01 , 02 , 01 , 02 }, and complement the basis (on A) of a reduced table by 
implications oi —>■ m, m —)■ oi, o —)■ A \ u. 

The order relation on A' consists of Oi ^ Oi, 6 and 02 ^ 02 , b. The order relation 
on U' consists of 2 > 4. This allows us to determine the arrow relations between 
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elements of A' and [/', which are shown in the table. The Galois lattice L'p^ of the 
reduced table in shown on Fig. 1. Note that objects of the reduced table correspond 
to the following (meet irreducible) elements of the lattice: 1 = ai, 2 = 5, 3 = 02 
and 4 = Cl V C 2 . 



Figure 1. Galois lattice from the table 

In order to recover all implications X —>■ 6, for some X (1 A', first identify 
elements of bD = { 01 , 02 , 01 , 02 }. Also, M[h) = {1,3,4}. Taking any m € M{b), 
we find corresponding subsets of family M: Mi = {oi,ci}, M 3 = { 02 , 02 }, M 4 = 
{ 01 , 02 }. This is done by picking the support of each element m S M( 6 ) within 
bD. The hypergraph dualization problem finds the minimal elements in 2*"^ \ J, 
where J is the order ideal in 2^^ generated by the family A4 = {Mi,M 3 ,M 4 }. 
Evidently, these will be Yi = { 01 , 02 }, T 2 = {o 2 ,Ci}, and = { 01 , 02 }. This gives 
all implications from the basis with the conclusion b: 0102 —>■ b, 0201 —>■ b and 
0102 —>■ b. 

We note that the last implication is not in the Z3-basis and one can use the 
algorithm from [ 2 ] to remove it. 

We can mention that the reduction of the retrieved basis can be considerable, 
and the size of the reduced part may depend on the number of binary implications 
in the basis, i.e., implications of the form x ^ y. For example, one of the tests 
on a small matrix of size 10 x 22 found 9 binary implications in the basis, and the 
H-basis had a total of 635 implications, after 212 implications were reduced in the 
last phase of the algorithm. No reduction will occur if the binary part of the basis 
is empty: in this rare case the canonical direct basis and Z?-basis coincide. 
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9. Results of initial testing 

The algorithm of -D-basis recovery described in section was implemented in 
C++ code by the team of undergraduate students of Yeshiva University. They were 
able to implement the call to an existing subroutine that performs the hypergraph 
dualization, and which is publicly available via repository maintained by T. Uno: 
http://research.nii.ac.jp/^uno/dualization.html 

Further optimizations of the code were done in collaboration with T. Uno and 
U. Norbisrath. The retrieval of the full implicational basis for a random matrix 
20 X 40 and density 0.2 (20% of entries of the matrix are ones) took 0.27 sec, for 
1616 implications. A similar test in U. Russel et al. [32] mentioned in section 
showed 0.9 sec for 1476 implications. 

Further tests with the new code were done on larger random matrices. On a 
table of size 50-by-100, the Z?-basis with 49,000 implications was obtained in 3 
min 30 sec. All the implications X —>■ 6, for a requested column 6 of a randomly 
generated matrix of size 50-by-200, were obtained in 25 min. 

For the initial comparisons with Apriori algorithm, we ran two types of tests. 

The first data set was taken from the Frequent Itemset Mining Dataset Repos¬ 
itory, publicly available at http://fimi.ua.ac.be/data/retail.dat. It was retail mar¬ 
ket basket data from an anonymous Belgian retail store. We took first 90 rows 
converting them to a binary matrix format with size 90-by-502 and (low) density 
0.0162. Running time was about 42 sec resulting in 104 mostly binary implications 
of maximal support 5 and confidence = 1. For the Apriori, we used Microsoft SQL 
Server Business Intelligence Development Studio, 2008 (Data Mining Technique - 
Microsoft Association Rules) jM] HD] . The result of this run, together with setup 
of the input, took about 4 min 30 sec, with a considerably larger set of association 
rules, most of which have confidence < I. We point that Apriori was designed 
specifically for mining association rules in retail data, and the specifically for data 
that is presented by large sets of item-sets. When converted to matrix form, it 
usually contains large number of rows (transactions) with comparably few columns 
(number of items at sale), and normally has low density. 

Our second data set was of a different nature, where we believe our novel ap¬ 
proach may have an edge over Apriori. 

The data-set was kindly provided by the research group of Dr. G. Okimoto from 
the University of Hawaii Cancer Center. It contained the gene expression levels for 
550 pre-selected genes, in 22 patients, some healthy and others with liver cancer. 
There are multiple approaches how to convert this matrix into binary format. For 
comparison of D-basis algorithm with Apriori, one of them was taken as test data of 
size 22-by-II12 and density 0.4684. Columns 1111 and 1112 represent the attributes 
of being healthy and having cancer, correspondingly. 

Microsoft Association Rules output was restricted to only 2000 frequent sets with 
minimum support 15 and confidence = 1, and there were no association rules with 
attributes 1111 and 1112. On the other hand, with the D-basis code we were able 
to reveal the equivalence of attribute 1111 to the set of 9 other attributes, and to 
find 14819 implications with the target attribute 1112, whose support = 8 and the 
confidence at least 21/22. 
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10. Further theoretical work and implementation 

Recovery of the implicational basis directly relates to knowledge discovery, espe¬ 
cially if one is concerned with the targeted set of implications in the basis. Addi¬ 
tional testing of biological data provided by the collaborators at the University of 
Hawaii Cancer Center and medical group in Astana, Kazakhstan, were aiming at 
data of a larger size, and at the retrieval of specific targeted association rules. The 
results of this testing were recently presented at the FCA conference [T]. 

From a theoretical point of view, the future plans include the generalization of 
the algorithm for the retrieval of association rules with the threshold of confidence 
/I < 1 (see section]^, and here some initial results are achieved, which may be 
implemented in the code. This will allow us to take into account the possibility 
of incomplete and erroneous data. On the other hand, a different approach exists 
which allows us to utilize the existing implementation. Namely, the algorithm can 
be run multiple times on the proper subsets of existing sets of rows. For example, 
if the original data contains 100 rows, then the algorithm can be run on all subsets 
of 95 rows, which will result in obtaining all association rules of confidence at least 
95%. This can be achieved easily with the assistance of a specialist in distributed 
computing. 

As far as the goal of the current paper, theoretical results of the H-basis recovery 
are supported by the practical evidence tested in [32] and initial testing presented 
in section that the recovery of the implicational basis based on the hypergraph 
dualization algorithm will bring a considerable cut in run-times when dealing with 
tabled data. This will make the new approach critical for analysis of large data 
sets. 

Acknowledgments. The results of this paper were prompted by the discussion of 
the hypergraph dualization algorithm with E. Boros and V. Gurvich, at the RUT- 
COR seminar during the first author’s visit in 2011. The test results of section 
were possible due to code implementation done by J. Blumenkopf and T. Moldwin 
(Yeshiva College, New York), assistance of T. Uno (National Institute of Informat¬ 
ics, Tokyo), U. Norbisrath and A. Amanbekkyzy (Nazarbayev University, Astana), 
and data provided by G. Okimoto (University of Hawaii, Honolulu). 
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